The multi-armed bandit 
problem with covariates 

Vianney Perchet* and Philippe Rigollet^ 

Universite Paris 7 and Princeton University 

Abstract. We consider a multi-armed bandit problem in a setting where 
each arm produces a noisy reward realization which depends on an observ- 
able random covariate. As opposed to the traditional static multi-armed 
bandit problem, this setting allows for dynamically changing rewards that 
better describe applications where side information is available. We adopt 
a nonparametric model where the expected rewards are smooth functions 
of the covariate and where the hardness of the problem is captured by a 
margin parameter. To maximize the expected cumulative reward, we intro- 
duce a policy called Adaptively Binned Successive Elimination (abse) that 
adaptively decomposes the global problem into suitably "localized" static 
bandit problems. This policy constructs an adaptive partition using a vari- 
ant of the Successive Elimination (se) policy. Our results include sharper 
regret bounds for the SE policy in a static bandit problem and minimax 
optimal regret bounds for the abse policy in the dynamic problem. 
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1. INTRODUCTION 

The seminal paper Robbins (1952) introduced an important class of sequential 
optimization problems, otherwise known as multi-armed bandits. These models 
have since been used extensively in such fields as statistics, operations research, 
engineering, computer science and economics. The traditional multi-armed ban- 
dit problem can be described as follows. Consider K > 2 statistical populations 
(arms), where at each point in time it is possible to sample from (pull) only one 
of them and receive a random reward dictated by the properties of the sampled 
population. The objective is to devise a sampling policy that maximizes expected 
cumulative rewards over a finite time horizon. The difference between the perfor- 
mance of a given sampling policy and that of an oracle, that repeatedly samples 
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from the population with the highest mean reward, is called the regret. Thus, one 
can re-phrase the objective as minimizing the regret. 

When the populations being sampled are homogenous, i.e., when the sequen- 
tial rewards are independent and identically distributed (iid) in each arm, the 
family of upper-confidence-bound (UCB) policies, introduced in Lai and Robbins 
(1985), incur a regret of order logn, where n is the length of the time horizon, 
and no other "good" policy can (asymptotically) achieve a smaller regret; see 
also Auer et al. (2002). The elegance of the theory and sharp results developed in 
Lai and Robbins (1985) hinge to a large extent on the assumption of homogenous 
populations and hence identically distributed rewards. This, however, is clearly 
too restrictive for many applications of interest. Often, the decision maker ob- 
serves further information and based on that, a more customized allocation can 
be made. In such settings, rewards may still be assumed to be independent, but 
no longer identically distributed in each arm. A particular way to encode this is 
to allow for an exogenous variable (a covariate) that affects the rewards generated 
by each arm at each point in time when this arm is pulled. 

Such a formulation was first introduced in Woodroofe (1979) under parametric 
assumptions and in a somewhat restricted setting; see Goldenshluger and Zeevi 
(2009, 2010) and Wang et al. (2005) for very different recent approaches to the 
study of such bandit problems, as well as references therein for further links to 
antecedent literature. The first work to venture outside the realm of paramet- 
ric modeling assumptions appeared in Yang and Zhu (2002). In particular, the 
mean response in each arm, conditionally on the covariate value, was assumed to 
follow a general functional form, hence one can view their setting as a nonpara- 
metric bandit problem. They propose a variant of the e-greedy policy, see, e.g., 
Auer et al. (2002) and show that the average regret tends to zero as the time hori- 
zon n grows to infinity. However, it is unclear whether this policy satisfy a more 
refined notion of optimality, insofar as the magnitude of the regret is concerned, 
as is the case for UCB-type policies in traditional bandit problems. Such ques- 
tions were partially addressed in Rigollet and Zeevi (2010) where near-optimal 
bounds on the regret are proved in the case of a two-armed bandit problem un- 
der only two assumptions on the underlying functional form that governs the 
arms' responses. The first is a mild smoothness condition and the second is a 
so-called margin condition that involves a margin parameter which encodes the 
"separation" between the functions that describe the arms' responses. 

The purpose of the present paper is to extend the setup of Rigollet and Zeevi 
(2010) to the ii'-armed bandit problem with covariates when K may be large. 
This involves a customized definition of the margin assumption. Moreover, the 
bounds proved in Rigollet and Zeevi (2010) suffered two deficiencies. First, they 
hold only for a limited range of values of the margin parameter and second, the 
upper bounds and the lower bounds mismatch by a logarithmic factor. Improving 
upon these results requires radically new ideas. To that end, we introduce three 
policies: 

1. Successive Elimination (se) is dedicated to the static bandit case. It is 
the cornerstone of the others policies that deal with covariates. During a 
first phase, this policy explores the different arms, builds estimates and 
eliminates sequentially suboptimal arms; when only one arm remains, it is 
pulled until the horizon is reached. A variant of SE was originally introduced 
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in Even-Dar et al. (2006). However, it was not tuned to minimize the regret 
as other measures of performance were investigated in this paper. We prove 
new regret bounds for this pohcy that improve upon the canonical papers 
Lai and Robbins (1985) and Auer et al. (2002). 

2. Binned Successive Elimination (bse) follows a simple principle to solve the 
problem with covariates. It consists in grouping similar covariates into bins 
and then look only at the average reward over each bin. These bins are 
viewed as indexing "local" bandit problems, solved by the aforementioned 
SE policy. We prove optimal regret bounds, polynomial in the horizon but 
only for a restricted class of difficult problems. For the remaining class of 
easy problems, the bse policy is suboptimal. 

3. Adaptively Binned Successive Elimination (abse) overcomes a severe lim- 
itation of the naive bse. Indeed, if the problem is globally easy (this is 
characterized by the margin condition), the bse policy employes a fixed 
and too fine discretization of the covariate space. Instead, the abse pol- 
icy partitions the space of covariates in a fashion that adapts to the local 
difficulty of the problem: cells are smaller when different arms are hard to 
distinguish and bigger when one arm dominates the other. This adaptive 
partitioning allows us to prove optimal regrets bounds for the whole class 
of problems. 

The optimal polynomial regret bounds that we prove are much larger than the log- 
arithmic bounds proved in the static case. Nevertheless, it is important to keep in 
mind that they are valid for a much more flexible model that incorporates covari- 
ates. In the particular case where K = 2 and the problem is difficult, these bounds 
improve upon the results of RigoUet and Zeevi (2010) by removing a logarithmic 
factor that is idiosyncratic to the exploration vs. exploitation dilemma encoun- 
tered in bandit problems. Moreover, it follows immediately from the previous 
minimax lower bounds of Audibert and Tsybakov (2007) and RigoUet and Zeevi 
(2010), that these bounds are optimal in a minimax sense and thus cannot be 
further improved. It reveals an interesting and somewhat surprising phenomenon: 
the price to pay for the partial information in the bandit problem is dominated 
by the price to pay for nonparametric estimation. Indeed the bound on the regret 
that we obtain in the bandit setup for = 2 is of the same order as the best 
attainable bound in the full information case, where at each round, the operator 
receives the reward from only one arm but observes the rewards of both arms. An 
important example of the full information case is sequential binary classification. 

Our policies for the problem with covariates fall into the family of "plug-in" 
policies as opposed "minimum contrast" policies; a detailed account of the dif- 
ferences and similarities between these two setups in the full information case 
can be found in Audibert and Tsybakov (2007). Minimum contrast type policies 
have already received some attention in the bandit literature with side informa- 
tion, aka contextual bandits, in the papers Langford and Zhang (2008) and also 
Kakade et al. (2008). A related problem online convex optimization with side in- 
formation was studied in Hazan and Megiddo (2007), where the authors use a 
discretization technique similar to the one employed in this paper. It is worth 
noting that the cumulative regret in these papers is defined in a weaker form 
compared to the traditional bandit literature, since the cumulative reward of a 
proposed policy is compared to that of the best policy in a certain restricted class 
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of policies. Therefore, bounds on the regret depend, among other things, on the 
complexity of said class of policies. Plug-in type policies have received attention in 
the context of the continuum armed bandit problem, where as the name suggests 
there are uncountably many arms. Notable entries in that stream of work are 
Lu et al. (2010) and Slivkins (2011), who impose a smoothness condition both on 
the space of arms and the space of covariates, obtaining optimal regret bounds 
up to logarithmic terms. 

2. IMPROVED REGRET BOUNDS FOR THE STATIC PROBLEM 

In this section, it will be convenient for notational purposes, to consider a 
multi-armed bandit problem with K + 1 arms. 

We revisit the Successive Elimination (se) policy introduced in Even-Dar et al. 
(2006) in the traditional setup of multi-armed bandit problems. As opposed to the 
more popular UCB policy (see, e.g., Lai and Robbins (1985); Auer et al. (2002)), 
it allows us in the next section, to construct an adaptive partition that is crucial 
to attain optimal rates on the regret for the dynamic case with covariates. In this 
section, we prove refined regret bounds for the SE policy that exhibit a better 
dependence on the expected rewards of the arms compared to the bounds for 
UCB that were derived in Auer et al. (2002). Such an improvement was recently 
attempted in Auer and Ortner (2010) and also in Audibert and Bubeck (2010) 
for modified UCB policies and we compare these results to ours below. 

Let us recall the traditional setup for the static multi-armed bandit prob- 
lem (see, e.g., Auer et al. (2002)). Let X = {1, . . . ,K + 1} be a given set of 

(i) (i) 

K + 1 > 2 arms. Successive pulls of arm i £ I yield rewards , , • • • that 
are iid random variables in [0, 1] with expectation given by ]E[1^''*'*] = /(*^ G [0, 1]. 
Assume without loss of generality that f^^^ < • • • < f(^+^) so that -1- 1 is 
one of the best arms. For simplicity, we further assume that the best arm is 
unique since for the SE policy, having multiple optimal arms only improves the 
regret bound. In the analysis, it is convenient to denote this optimal arm by 
-k := K + 1 and to define the gaps traditionally denoted by Ai > . . . > = 0, 
by Ai = /(*) - /(*) > 0. 

A policy TT = {iTt} is a sequence of random variables vrj G {I,-- - ,K+1} 
indicating which arm to pull at each time t = 1, . . . n, and such that irt depends 
only on observations strictly anterior to t. 

The performance of a policy tt is measured by its (cumulative) regret at time 
n defined by 

n 
t=l 

Note that for a data-driven policy vr, this quantity is random and, in the rest of 
the paper, we provide upper bounds on ]Ei?(7r). Such bounds are referred to as 
regret bounds. 

We begin with a high-level description of the SE policy denoted by n. It operates 
in rounds that are different from the decision times t = 1, . . . , n. At the beginning 
of each round r, a subset of the arms has been eliminated and only a subset XT- 
remains. During round r, each arm in I-r is pulled exactly once (Exploration). 
At the end of the round, for each remaining arm in It, we decide whether to 
eliminate it using a simple statistical hypothesis test: if we conclude that its mean 
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is significantly smaller than the mean of any remaining arm, then we eliminate 
this arm and we keep it otherwise (Elimination). We repeat this procedure until 
n pulls have been made. The number of rounds is random but obviously smaller 
than n. 

The SE policy, which is parameterized by two quantities T G IN and 7 > 
and described in Policy 1, outputs an infinite sequence of arms 7ri,7r2, . . . with- 
out a prescribed horizon. Of course, it can be truncated at any horizon n. This 
description emphasizes the fact that the policy can be implemented without per- 
fect knowledge of the horizon n and in particular, when the horizon is a random 
variable with expected value n (see Corollaries 2.1 and 2.3 used in Sections 4 
and 5); Nevertheless, in the static case, it is manifest from our result that, when 
the horizon is known to be n, choosing T = n is always the best choice when 
possible and that other choices may lead to suboptimal results. 

Note that after the exploration phase of each round r = 1,2,..., each remain- 
ing arm i £ Ij- has been pulled exactly r times, generating rewards Y^^\ . . . , Yr ^\ 
Denote by F (r) the average reward collected from arm i G Xr at round r that 
is defined by Y^^\t) = (1/r) Ylt=i '^t^\ where here and throughout this paper, 
we use the convention 1/0 = 00. For any positive integer T, define also 

(2.1) u{t,T) = 2^^^, 

which is essentially a high probability upper bound on the magnitude of deviations 
of y(j)(r) - y(*)(r) from its mean /(j) - 

The SE policy for a K-aimed bandit problem can be implemented according 
to the pseudo-code of Policy 1. Note that, to ease the presentation of Sections 4 
and 5, the SE policy also returns at each time t, the number of rounds Tt completed 
at time t and a subset St G Vil) of arms that are active at time t, where V{I) 
denotes the power set of I. 

Policy 1 Successive Elimination (se) 

Input: Set of arms I = {1, . . . , K}; Parameters T, 7; Horizon n 
Output: (7ri,fi,S'i),(7i-2^f2,S'2),-- - G X x IN x P(X). 

r ^ 1, 5" ^ X, t ^ 0, r ^ (0, . . . , 0) G [0, i]-^ 

loop 

ymax ^ jnaxjY''' : i G S} 
for i G 5 do 

if y (^) > Y'^^'^ - 7[/(r, T) then 
t + 1 

Tft <— i (observe y'*'). 
5* 5, ft r 
yW ^ i[(^_i)yW + yW]_ 

else 

S ^S\{i}. 
end if 
end for 
r r + 1. 
end loop 



The following theorem gives a first upper bound on the expected regret of 
the SE policy. In the rest of the paper, log denotes the natural logarithm and 
log(x) = log(x) V 1. 
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Theorem 2.1. Consider a (K+l) -armed bandit problem. When implemented 
with parameters T = n,^ = 1, the SE policy tt exhibits an expected regret bounded 
as 

(2.2) Ei?„(7r) <min|334^i-bi(^^^ , logiK)^. 

Proof. Define £r = U{T,n). Moreover, for any i in tlie set Ir of arms that 
remain active at the beginning of round r, define Aj(T) := Y^*\t) — y(*)(r). 
Recall that, at round r, if arms i,-k G It, then (i) the optimal arm * eliminates 
arm i if Aj(r) > Sj- and (ii) arm ?' eliminates arm ★ if Aj(r) < —£r- 

Since Aj(r) estimates Aj, the event in (i) happens approximately, when e^- ~ 
Aj, so we introduce the deterministic, but unknown, quantity r* (and its approx- 
imation Tj = [r*] ) defined as the solution of: 



Ai = -£r 



logf^^) , log(nA2/18) 

\ 2 so that Ti < r* + 1 < 18 .2 — + ^ 

\ i 



since log(a;) = max{log(a;), 1}. Moreover, it holds that 1 < ri < . . . < tk- 

Wc are going to decompose the regret accumulated by a suboptimal arm i into 
three quantities: 

- the regret accumulated by pulling this arm at most until round Tf. this 
regret is smaller than r^Aj ; 

- the regret accumulated by eliminating the optimal arm ★ between round 
Ti-i + 1 and Ti, 

- the regret induced if arm i is still present at round Tj (and in particular, if 
it has not been eliminated by the optimal arm -k). 

We prove that the second and third events happen with small probability, 
because of the choice of r^. Formally, define the following good events: 

Ai = {The arm ★ has not been eliminated before round Tj} , 

Bi = {Every arm j £ {1, . . . ,i} has been eliminated before round Tj} . 

Moreover, define Ci = Ai Ci Bi and observe that Ci 5 C2 5 . . . . For any i = 
1,. . . ,K, the contribution to the regret incurred after time Ti on Ci is at most 
nAj_|_i since each pull of arm j > i + I contributes to the regret by Aj < Aj+i. 
We decompose the underlying sample space denoted by Co into the disjoint union 
(Co \ Ci) U • • • U {Ckq-1 \ Ckq) U Cx() where Kq G {1, . . . , K} is chosen later. 
It implies the following decomposition of the expected regret: 

Ko Ko 

(2.3) JERniTT) < J2 nA,P(C,_i \Ci) + Y^ TiAi + nA^o+i • 

i=l i=l 

Define by A'^ the complement of an event A. Note that the first term on the 
right-hand side of the above inequality can be bounded as follows 

Ko Ko Ko 

(2.4) nAiWid-i \Ci)<nY^ A,P(^f n d-i) + n ^ Ai^iB^ n A n Bi-i) , 

1=1 i=l 1=1 
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where the right-hand side was obtained using the decomposition Cf = A^U {Bf H 
Ai) and the fact that Ai C Ai-i. 

On the event Ai D Bi-i, every suboptimal arm j < i — 1 has been ehminated 
before round Tj_i and the optimal arm * is present at round r^; so the probabiHty 
P(i3f n AiH Bi-i) is smaller than P(Aj(rj) < ErJ. From Hoeffding's inequality, 
we have that for every e G (0, A) and every r > 1: 



(2.5) P ^Ai(r) < ej = P (^Ai(r) - Aj < e - A^ ) < exp 
The choice of Ti implies that Aj > '^e^, so that 



(2.6) WiB^ n A n Bi.i) < P(Ai(Ti) < < exp 



< 



n 



It remains to bound the first term in the rhs of (2.4). On the event Ci-i, the 
optimal arm -k has not been eliminated before the round rj_i but every suboptimal 
arm j < i — 1 has. So the probability that there exists an arm j > i that eliminates 
•k between Tj_i and Tj can be bounded as 

P(AnCi_i) < P(3(j», i< j<i^,ri_i + l<s<r,; Aj(s)<-e,) 

K 



< ^P(3 s,ri_i + 1 < s < Ti; Aj(s) < 
j=i 

K 



P(3s < r; Aj(s) < -e, ) < 4 



^^0 

5^AiP(Anc 



where, using Lemma A.l, we get ^jir) := 
Moreover, the above inequality implies that 

Ko K 

4=1 j=i 
K jAKo-1 K 

j=l i=l j=l 

j=l i=l j=l 

Using the facts that Tj < igi^iil^j!^^ _)_ i and Aj+i < Aj, the first sum can be 

i 

bounded as : 

K jAKo-l 



^,(a.-a.,,) < K+isY E M"^) 



nAA Ai- Ai+i 



i=i i=i 



j=i i=i 

K Ai 



A, 



7 = 1 •''^jAA'o V / 



^ 1 

^ A.A 



log 



18 



+ 2 
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The previous two displays yield 

Putting together (2.3), (2.4), (2.6) and the above display yields 

Choosing now the value Kq = K completes the proof of the first bound. 
On the other hand, the specific value 



A'o+l 



j. log(nAf/18) l n\og{K) \ 
i^o = max|^<i^; ^| 

implies that /S.Ko+i < 2y^ K log{K)/n which gives the second bound. | 

The right-hand side of (2.2) is the minimum of two terms. The first term 
is distribution-dependent and shows that the SE policy adapts to the unknown 
distribution of the rewards. It is very much in the spirit of the original bound of 
Lai and Robbins (1985) and of the more recent finite sample result of Auer et al. 
(2002). Our bound for the SE policy is smaller than the aforementioned bounds for 
the UCB policy by a logarithmic factor. Lai and Robbins (1985) did not provide 
the first bounds on the expected regret. Indeed, Vogel (1960) and Bather (1981) 
had previously derived what is often called gap-free bound as they hold uniformly 
over the Aj's. The second term in our bound is such a gap-free bound. It is of 
secondary interest in this paper and arise as a byproduct of refined distribution 
dependent bound. Nevertheless, it allows us to recover near optimal bounds of 
the same order as Juditsky et al. (2008). They depart from optimal rates by 
a factor ^/logK as proved in Audibert and Bubeck (2010). Actually, the result 
of Audibert and Bubeck (2010) is much stronger than our gap-free bounds since it 
holds for any sequence of bounded rewards, not necessarily drawn independently. 

None of the distribution-dependent bounds in Theorem 2.1 or the one provided 
in Audibert and Bubeck (2010) is stronger than the other. The superiority of one 
over the other depends on the set {Ai, . . . , Ak}'- in some cases (for example if all 
suboptimal arms have the same expectation) the latter is the best while in other 
cases (if the Aj are spread) our bounds are better. 

The next corollaries follow from slight variations on the proof of Theorem 2.1. 
We only sketch their proof for brevity. 

Corollary 2.1. When implemented with any parameter T and 7 = 1 and 
run at horizon n, the SE policy vr exhibits an expected regret hounded as 



]Ei?„(^) < 334max{l,^}^^log I 

i=i ^ V 



/TA| 
18 



Thus if the horizon is a random variable N of expectation n that is independent 
of the random rewards, the SE policy vr implemented with T exhibits an expected 
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regret bounded as 




Proof. In this setting, the regret induced by ehminating -k (or not eliminating 
a suboptimal arm i before round Tj) is still bounded above by nAj. However, one 
has to substitute T to n in the probability of making such mistakes. | 



The following corollaries are used in Sections 4 and 5. 

Corollary 2.2. Let Kq be an integer between 1 and K . When implemented 
with parameters T = n,j = 1, the SE policy vr exhibits an expected regret bounded 
as 



1 = 1 * ^ 



JERnirc) < 334 )^ ^ log ( j + 296^^^-^ log I \ + nA^o+i 



Proof. It was proved in (2.7). | 

This corollary is actually closer to the result of Auer and Ortner (2010). The 
additional second term in our bound comes from the fact that we had to take into 
account the probability that an optimal arm ★ can be eliminated by any arm, not 
just by some suboptimal arm with index lower than Kq (see Auer and Ortner 
(2010), page 8). It is unclear why it is enough to look at the elimination by 
those arms, since if * is eliminated - no matter the arm that eliminated it -, the 
Hoeffding bound (2.5) no longer holds. 

Corollary 2.3. If the horizon is a random variable N of expectation n 
that is independent of the random rewards, the SE policy vr implemented with 
parameters T, 7 > 1 exhibits an expected regret bounded as 

nRNiTT)] < 33472 (1 + 5) -^bi ( +n^^A] 



T) A^/"^\^ I872 y "^0' 
for any Kq G { 1 , . . . , K} and where A]^^ is the largest Aj such that Aj < Akq ■ 



Proof. If C/'(r,r) = 7[/(T,r) = l-f^J^^^^), then A^ = \ JJ'(tI,T) implies 
that t[ := \t*] < ^ log ( jg;^ j +1. With a larger confidence bound e'^ := U{t, n), 
Equation (2.6) becomes 

W{Bf n A n Bi-i) < JP{Ai{n) < e'^,) < exp 



V 



n J n 



Similarly, 

f(3s < r'; Aj{s) < -e',) = p(3s < r'; A,(s) < -76,) < $,(t') < 4^, 
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thus the proof of Theorem 2.1 holds except that the upper bound of Tj must be 
replaced by the one of t|. | 

3. BANDIT WITH COVARIATES 

This section is dedicated to a detailed description of nonparametric bandit 
with covariates. 

3.1 Machine and game 

A K -armed bandit machine with covariates (with K an integer greater than 
2) is characterized by a sequence 

t = i,2,... 

of independent random vectors, where (Xt)^^^, is a sequence of iid covariates 

in X = [0,1]*^ with probabihty distribution Px, and Y^^^ denotes the random 
reward yielded by arm i at time t. Throughout the paper, we assume that Px 
has a density, with respect to the Lebesgue measure, bounded above and below by 
some c > and c > respectively. We denote by Ex the expectation with respect 
to Px- We assume that, for each i £ I = {1, . . . , K}, rewards Y^^\t = 1, . . . , n 
are random variables in [0, 1] with conditional expectation given by 

lE[y/')|Xt] = /«(Xt), i = h...,K, t = l,2,... 

where f^^\i = 1, . . . ,K, are unknown functions such that < f^^\x) < 1, for 
any i = 1, . . . , x £ X. A natural example is where 1^^*^ takes values in {0, 1} 
so that the conditional distribution of Y^^^ given Xt is Bernoulli with parameter 

The game takes place sequentially on this machine, pulling one of the arms 
at each time t = 1, . . . ,n. A policy vr = {vTf} is a sequence of random functions 
TTt : — )• {1, . . . , K} indicating to the operator which arm to pull at each time t, 
and such that ttj depends only on observations strictly anterior to t. The oracle 
policy TT*, refers to the strategy that would be run by an omniscient operator with 
complete knowledge of the functions f^^\i = 1, . . . ,K. Given side information Xt, 
the oracle policy vr* prescribes to pull any arm with the largest expected reward, 
i.e., 

7r*(Xt) Gargmax/W(XO, 
i=l,...,K 

with ties broken arbitrarily. Note that the function f^'^*^^^\x) is equal to the 
pointwise maximum of the functions f^''\i = 1, . . . , K defined by 

r(x)=max{/«(x); i = l,...,if} . 

The oracle rule is used to benchmark any proposed policy vr and to measure the 
performance of the latter via its ( cumulative ) regret at time n defined by 

n n 

t=i t=i 
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Without further assumptions on the machine, the game can be arbitrarily 
difficult and, as a result, expected regret can be arbitrarily close to n. In the 
following subsection, we describe natural regularity conditions under which it is 
possible to achieve sublinear growth rate of the expected regret, and characterize 
policies that perform in a near-optimal manner. 

3.2 Smoothness and margin conditions 

As usual in nonparametric estimation we first impose some regularity on the 
functions = 1, . . . ,K. Here and in what follows we use || • || to denote the 

Euclidean norm on IR"'. 

Smoothness condition. We say that the machine satisfies the smoothness con- 
dition with parameters {P,L) if is (/?, L)-Holder, i.e., if 

\f^'\x) - f^'\x')\ <L\\x-x'f, yx,x' e X,i = l,...,K 

for some (3 G (0, 1] and L > 0. 

Now denote the second pointwise maximum of the functions f^^\i = 1, . . . ,K 
by f'^; formally for every x £ X such that min/(*''(x) 7^ max/(*)(a;) it is defined 
by: 

/«(x)=max{/«(:r); < r(:E)} 

and by /"(x) = /*(x) = f^-^^x) otherwise. Notice that a direct consequence 
of the smoothness condition is that the function /* is (/?, L)-Holder; however, 
might not even be continuous. The behavior of the function A := /* — f'^ 
critically controls the complexity of the problem and the Holder regularity gives 
a local upper bound on this quantity. The second condition gives a lower bound on 
this function though in a weaker global sense. It is closely related to the margin 
condition employed in classification Tsybakov (2004); Mammen and Tsybakov 
(1999), which drives the terminology employed here. It was originally imported 
to the bandit setup by Goldenshluger and Zeevi (2009). 

Margin condition. We say that the machine satisfies the margin condition 
with parameter q > if there exists (5o G (0, 1), Co > such that 

Px[0< nX) - fHx) < 5] < Co<5" , V5 G [0, Jo] 

If the marginal Px has a density bounded above and below, the margin con- 
dition contains only information about the behavior of the function A and not 
the marginal Px itself. This is in contrast with Goldenshluger and Zeevi (2009) 
where the margin assumption is used precisely to control the behavior of the 
marginal Px while that of the reward functions is fixed. A large value of the 
parameter a means that the function A either takes value or is bounded away 
from 0, except over a set of small Px-pi'obability. Conversely, for values of a close 
to 0, the margin condition is essentially void and the two functions can be arbi- 
trary close, making it difficult to distinguish them. This reflects in the bounds on 
the expected regret derived in the subsequent section. 

Intuitively, the smoothness condition and the margin condition work in oppo- 
site directions. Indeed, the former ensures that the function A does not "depart 
from zero" too fast whereas the latter warrants the opposite. The following propo- 
sition quantifies the extent of this conflict. 
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Proposition 3.1. Under the smoothness condition with parameters {13, L), 
and the margin condition with parameter a, the following holds 

- If a(3 > d then a given arm is either always or never optimal; in the latter 
case, it is hounded away from f* and one can take a = oo; 

- If a 13 < d then there exist machines with nontrivial oracle policies. 

Proof. This proposition is a straightforward consequences of, respectively, the 
first two points of Proposition 3.4 in Audibert and Tsybakov (2005). 

For completeness, we provide an example with d = 1, X = [0,1], /^^^ = . . . = 
= and f^'^\x) = Lsign(x - .5)|3; - .5|^/". Notice that /(i) is L)-Holder 
if and only if a/3 < 1. Any oracle policy is non-trivial, and, for example, one can 
define 7r*(x) = 2 if x < .5 and 7r*(x) = 1 if x > .5. Moreover, it can be easily 
shown that the machine satisfies the margin condition with parameter a and with 
6o = Co = l. I 

We denote by M.^[a, j3, L) the class of JT-armed bandit problems with covari- 
ates in = [0,1]*^ with a machine satisfying the margin condition with parameter 
a > 0, the smoothness condition with parameters (/3, L) and such that Px has 
a density, with respect to the Lebesgue measure, bounded above and below by 
some c > and c > respectively. 

3.3 Binning of the covariate space 

To design a policy that solves the bandit problem with covariates described 
above, one has to inevitably find an estimate of the functions f^^\i = 1, . . . ,K 
at the current point Xf. There exists a wide variety of nonparametric regression 
estimators ranging from local polynomials to wavelet estimators. Both of the 
policies introduced below are based on estimators of f^^\i = 1,. . . ,K that are 
Px almost surely piecewise constant over a particular collection of subsets, called 
bins of the covariate space X. 

We define a partition of in a measure theoretic sense as a collection of mea- 
surable sets, hereafter called bins, Bi,B2, ■ ■ ■ such that Px{Bj) > 0, Uj>i — 
and Bj r]Bi. = 9, j,k>l,up to sets of null Px probability. For any reward func- 
tion Z^*-* on X, the average reward on bin B is defined by 

(3.8) /« = IE[/«(Xi)|Xi eB] = — [ f^\x)dPx{x) , 

PB Jb 

where pB = Px{B) ■ 

To define and analyze our policies, it is convenient to reindex our observations 
(Xt, y/"^\ . . . ,Y^^^)t>i as follows. Given a bin B, let t_B(s) denote the sth time 
at which the sequence (Xf)t>i is in B and observe that it is a stopping time. 
Moreover, define the sequence of successive rewards obtained from arm i at times 

(i) (i) 

at which Xt £ B hy Y^'^^ = Y^'-^'^^yS > 1. It is a standard exercise to show 
that the rewards yI^\, 1^*2' • • • random variables in [0, 1] with expectation 

given by G [0,1]. Therefore, if we restrict our attention to observations in 
a given bin B, we are in the same setup as the static bandit problem studied 
in the previous section. This observation leads to the notion of policy on B. 
More precisely, fix a subset B C X, an integer to > 1 and recall that {t_B(s) : 
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s > l,t_B(s) > to} is the set of chronological times t posterior to to at which 
Xt & B. Fix I' C T and consider the static bandit problem with arms I' defined 
in Section 2 where successive pulls of arm i G X' yield rewards yI^\ , Y^*2 > • • • > 
where Y^^^ = Y^^|_,^, tB(s) > tg. The SE policy with parameters T, 7 on this static 
problem is called SE policy on B initialized at time tg with initial set of arms X' 
and parameters T, 7. 

4. BINNED SUCCESSIVE ELIMINATION 

We first outline a naive policy to operate the bandit machine described in 
section 3. It consists in fixing a partition of X and for each set B in this partition, 
to run the SE policy on B initialized at time to = 1 with initial set of arms I and 
parameters to be defined below. 

The Binned Successive Elimination (bse) policy vf relies on a specific partition 
of A'. Let Bm '■= {Bi, . . . , Bj^.jd} be the regular partition of X = [0, 1]*^, i.e., the 
reindexed collection of hypercubes defined for k = {ki, . . . , k^) G {1, . . . , M}'^ by 

^ I Af - ^ - M ' ' ' J 

In this paper, all sets are defined up to sets of null Lebesgue measure. As men- 
tioned in subsection 3.3, the problem can be decomposed into M'^ independent 
static bandit problems, one for each B G Bm- 

Denote by ttb the SE policy on bin B with initial set of arms I and parameters 
T = nM~'^,'y = 1. For any x G Af, let B{x) G Bm denote the bin such that 
X G B{x). The BSE policy vf is a sequence of functions itt : X ^ I defined 
by 7tt{x) = T^B,tB{t)-: where B = B{x). It can be implemented according to the 
pseudo-code of Policy 2. 

Policy 2 Binned Successive Elimination (bse) 

Input: Set of arms I = {1, . . . , K}. Parameters n, M. 
Output: 7fi, . . . 7f„ G I. 
B ^ Bm 
for B e Bm do 

Initialize a SE policy tts with parameters T — nM~'^ ,^ = 1. 
tB <- 0. 
end for 

for t = 1 , . . . , n do 

B^B{Xt). 
ts ^ts + 1. 

7ft ^ TTs.tfj (observe f/"''). 
end for 



The following theorem gives an upper bound on the expected regret of the 
BSE policy in the case where the problem is difficult, that is, when the margin 
parameter a satisfies < a < 1. 

Theorem 4.1. Fix (3 ^ (0, 1], L > and a G (0, 1) and consider a problem 

in A^;^ (a, /3, L). Then the bse policy vr with M = [( Kiog{K) ) -I ^ 

expected regret at time n hounded as follows, 



IEi?„(7f) < Cn 



fi{a + l) 

K\ogK\ 2/3+d 



n 
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where C > is a positive constant that does not depend on K. 

The case K = 2 was studied in Rigollet and Zeevi (2010) using a similar policy 
called UCBogram. The bound in Theorem 4.1 improves upon the rate for the 
UCBogram by a logarithmic term in n. This comes from the fact that, unlike 
in Rigollet and Zeevi (2010) where suboptimal bounds for the UCB policy are 
used, we use here the sharper regret bounds of Corollary 2.3 and the SE policy 
as a running horse for our policy. Optimality of this bound in the case oi K = 2 
arms is discussed after Theorem 5.1. 

Proof. We assume that Bm = {Bi, ■ ■ ■ -.Bj^jd} where the indexing will be 
made clearer later in the proof. Moreover, to keep track of positive constants, we 
number them ci, C2, . . .. For any real valued function / on A" and any measurable 
^ C Af, we use the notation Px{f ^ A) = Px{f{X) G ^4). Moreover, for any 
i G 1 , . . . , K} , we use the notation /j*^ = /^j' . 

Define ci = 2Ld^^'^ + 1, and let no > 2 be the largest integer such that 
^/3/(2/3+rf) ^ 2ci/5oi where 6q is the constant appearing in the margin condi- 
tion. If n < riQ, we have i?ra(^) < so that the result of the theorem holds when 
C is chosen large enough, depending on the constant no- In the rest of the proof, 
we assume that n > no so that ciM~l^ < 6o- 

Recall that the BSE policy vf is a collection of functions vft = T^B,tB{t) that are 
constant on each Bj. Therefore, the regret of vf can be decomposed as Rn{T^) = 

Ei=i Ri(7f), where 

n 

t=i 

Consider the set of well behaved bins on which the expected reward functions of 
the arms are well separated. These are the bins Bj with indices in J defined by 

J := {j G {1, . . . , M''] s.t. 3 X G i?j , f^x) - f\x) > ciM-^] . 

A bin B that is not well behaved is called strongly ill behaved if there is some 
X G B such that f*{x) = /^(x) = f^^\x), for all i G X, and weakly ill behaved 
otherwise. Respectively, the sets of strongly or weakly ill behaved bins have indices 
in 

Js ■■= {j e {1, • • • , M'^} s-t. 3xeB,, r(x) = /«(x)} and 

X ■■= {j e {1, • • • , M''} s.t. yxeBj,0< fix) - /«(x) < ciM-^} . 

Note that for any i £ I, the function /* — /(*^ is (/3, 2L)-Holder. It implies that 
for any j G J^g and any i = 1,. . . ,K, we have f*{x) — /(*^(x) < ciM~^ for all 
X G Bj so that the inclusion C {1, . . . , M'^} \ J indeed holds. 

First part: Strongly ill behaved bins in J^. 

Recall that for any j G J^^ any arm i G X, and any x G Bj, f*{x) — f^'^\x) < 
ciM^I^ . Therefore, 

IERj(7f) < cinM~^Px {o < f*{X) - f\X) < ciM"^} 
(4.9) < cJ+°nM-''(i+") , 
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where we used the fact that the set {x £ X : f*{x) = f\x)} does not contribute 
to the regret. 

Second part: Weakly ill behaved bins in J'^. 

The numbers of weakly ill behaved bins can be bounded using the fact that 
f*{x) — /"(x) > on such a bin; indeed, the margin condition implies that 

^ < Px {O < nX) - < ciM-/^} < c?M-^°. 

It yields \J^\< ^M'^-^'^ . 

We bound the expected regret on each weakly ill behaved bins using Corollary 

2.3 and the specific values Akq '■= \J ^1° 7 = 1 and T = nM^'^. It implies 
that there exists a positive constant C2 > such that: 

(4.10) ]ERj(7f) < \X\ sup ]ERj(7f) < C2^/K\og{K)Mi-f^''y/^. 
i&JS, ^^^^ 

Third part: Well behaved bins in J. 

This part is decomposed into two steps. In the first step, we bound the expected 
regret in a given bin Bj,j G J; in the second step we use the margin condition 
to control the sum of all these expected regrets. 

Step 1. Fix j £ J and recall that there exists Xj G Bj such that f*{xj) — 
f^{xj) > ciM-^. Define X* = {i G X : f^^{xj) = t{xj)} and X^ =1\X* = {i£ 

X : f*{xj) — f^^\xj) > ciM~^}. We callX* the set of (almost) optimal arms over 
Bj and Ij the set of suboptimal arms over Bj. Note that X° ^ for any j £ J. 
The smoothness condition implies that for any i £lj,x G Bj, 

(4.11) f*{x) - /(^)(x) > ciM~^ - 2L\\x - Xjf > M-l^ . 

Therefore, f*—f^ > on Bj. Moreover, for any arm i ^^j that is not the best arm 
at some x / Xj, then necessarily < f*{x) — /^(x) < f*{x) — f^^\x) < ciM~^. 
So for any x G Bj and any i G XJ, it holds that either /*(x) = f^^\x) or 
f*{x) - /W(x) < ciM-'^. It yields 

(4.12) r(x) - /W(x) < ciM~'^l |o < f*{x) - /»(x) < ciM-'^} . 

Thus, for any optimal arm i G X^, the reward functions averaged over Bj satisfy 
n -If < ciM-^qj, where 

q. := {O < r - /» < ciM-P \ X G Bj} . 
For any suboptimal arms i G Ij, (4.11) implies that A^*^ := fj — /-^^ > . 

(i) 

Assume now without loss of generality that the average gaps A^- are ordered 
in such a way that Af^ >...> A_f\ Define 

Ko := argmin aJ*^ and A^- := Af"^ 
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(i) 

and observe that if z G J" is such that < A^, then i ^ Ij- Therefore, it 

follows from (4.12) that A^*^ < ciM-Pqj for such i. Recall that the BSE runs 
a SE policy on Bj initialized at time Iq = \ so that it has a random horizon 
^ji^) — Y17=i ^{-^t S ^j)- Since the density of Px is bounded above, we get 
lE[A'j(n)] < cnM~'^. Applying Corollary 2.3 with Kq as above and 7 = 1, we find 
that there exists a constant C3 > such that, for any j S J, 

K / nM-'^A?\ . . 

lERj(t) < 336(1 + c)— log ( ^ ^ci™^^ ^ 



(4.13) < C3 log [nM-^^fj + nM-^-^q^ 



Step 2. We now use the margin condition to provide lower bounds on A^- for 
each j G J . Assume without loss of generality that the indexing of the bins is such 
that J = {1^ . . . and that the gaps are ordered < A^ < A2 < . . . < A^^. 
For any j G ^7, from the definition of A^-, there exists a suboptimal arm z G Xj* 

such that Aj. = /* - if- But since the function /* — satisfies the smoothness 
condition with parameters (/?, 2L), we find that if A^ < 5 for some (5 > 0, then 

< f*{x) - f^'\x) <6 + 2Ld^/'^M-^ , yxdBj. 

Together with the fact that f* — f^>0 over Bj for any j G J7 (see Step 1 above), 
it yields 

^'x [ < r - /« < A^. + 2Ld''/^M-^] > f^Pk^O < A, < A^.) > ^ , 

k=l 

where we used the fact that pk = Px{Bk) > c/M'^. Define j2 G ^7 to be the 
largest integer such that A^^ < (^o/ci- Since for any j G J , we have A^ > M~^, 
the margin condition yields for any j G {1, ... ,72} that, 

Px[0 < r - < Aj + 2Ld^l^M-P] < CsiciAjr , 

where we have used the fact that A^- + 2L(i^/^M~^ < ciAj < 5q, for any j G 
{1, . . . , j2}. The previous two inequalities, together with the fact that A^ > 
for any j £ yield 

A,- > C4(^)'^" V M-^ =: 7, , V j G {1, . . . , J2} . 
Therefore, using the fact that A^- > 60 /ci for j > j2, we get from (4.13) that 



(4.14) ^lER,(7f)<C5[£i^ "VJ-^y ^ ^ Klog(n) + j;nM-'^-% 



j<^J "i=i 3=32+1 jej 



Fourth part: Putting things together. 
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Combining (4.9), (4.10) and (4.14), we obtain the following bound. 



(4.15) ]Ei2„,(7f) < C6 nM-'^(i+") + log(K)M2 -"/'^ + K ^ 



^ log ( IWT? 



7i 



^KW^ log n + nM-'^-f^ ^ gj] 



We now bound from above the first sum in (4.15) by decomposing it into two 
terms. From the definition of 7^, there exists an integer J3 satisfying cyM"'""^ < 
J3 < 2cjM'^-"'^ and such that 7^ = M"^ for j < js and 7^- = C4(jM-"')i/" for 
j > J3. It holds 



(4.16) 
and 



< C8M'^+^(i-") log ( 



n 



E 



7j 



i=i3+i 



fit 



(4.17) 



Since a < 1, this integral is bounded by cioAf^^^"") (l + log [n / . 
The second sum in (4.15) can be bound as: 



-M' 



(4.18) < — P |0 < r(X) - f\X) < ciM~^j < 

Putting together (4.15)~(4.18), we obtain: 

]Ei?„(7f) < cii Lm-'^(^+") + ^ K\og{K)M^^-'^^ ^ + i^M'^+'^(^-") 



log 



n 



and the result follows by choosing M as prescribed. 



+ KW^ log n 



We should point out that the version of the bse described above specifies the 
number of bins M as a function of the horizon n, while in practice one may not 
have foreknowledge of this value. This limitation can be easily circumvented by us- 
ing the so-called doubling argument (see, e.g., page 17 in Cesa-Bianchi and Lugosi 
(2006)) which consists of "reseting" the game at times 2^, k = 1,2, . . . 

The reader will note that when a = 1 there is a potentially superfluous logn 
factor appearing in the upper bound using the same proof. More generally, for 
any a > 1, it is possible to minimize the expression in (4.15) with respect to M, 
but the optimal value of M would then depend on the value of a. This sheds 
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some light on a significant limitation of the bse which surfaces in this parameter 
regime: for n large enough, it requires the operator to pull each arm at least once 
in each bin and therefore to incur an expected regret of at least order M'^. In other 
words, the BSE splits the space X in "too many" bins when a > 1. Intuitively 
this can be understood as follows. When a > 1, the gap function f*{x) — /"(x) 
is bounded away from zero on a large subset of X. Hence there is no need to 
carefully estimate it since the optimal arm is the same across the region. As a 
result, one could use larger bins in such regions reducing the overall number of 
bins and therefore removing the extra logarithmic term alluded to above. 

5. ADAPTIVELY BINNED SUCCESSIVE ELIMINATION 

We need the following definitions. Assume that n > Klog{K) and let ko be 
the smallest integer such that 

(5.19) 2-^° < ( J 

For any bin B £ Ufc>0'^2'^' ^^t £b he the smallest integer such that 

(5.20) U{iB,n\B\'^) <2co\B\^ , 

where U is defined in (2.1) and cq = ILd^^"^. This definition implies that 

(5.21) is < Q|Sr2^1og(n|S|(2/3+'^)), 

for some positive constant Ci. 

The ABSE policy operates in a manner similar to the bse except that instead 
of fixing a partition Bm, it relies on an adaptive partition that is refined over 
time. This partition is better understood using the notion of rooted tree. 

Let T* be a tree with root X and maximum depth ko. A node B of T* with 
depth A: = 0, . . . , fco — 1 is a set from the regular partition . The children of 
node B S are given by burst(i?), defined to be the collection of 2*^ bins in 
B2k+i that forms a partition of B. 

Note that the set C of leaves of each subtree T of T* forms a partition of X. 
The ABSE policy constructs a sequence of partitions £i, . . . ,£„ that are leaves 
of subtrees of T* ■ At a given time t = 1, . . . ,n, we refer to the elements of the 
current partition Ct as live bins. The sequence of partitions is nested in the sense 
that ii B £ Ct, then either B G Ct+i or burst(i3) C Ct+i- The sequence £i, . . . , 
is constructed as follows. 

In the initialization step, set Cq = f}>, Ci = X, and the initial set of arms 
Ix = {1, . . . ,K}. Let t < n be a time such that Ct ^ Ct-i and let be the 
collection of sets B such that B e Ct\ Ct-i- We say that the bins B e Bt are 
born at time t. For each set i? G Bj, assume that we are given a set of active 
arms Ib- Note that t = 1 is such a time with Bi = {X} and active arms I^. 
For each born bin B & B^, we run a SE policy ttb initialized at time t with initial 
set of arms Ib and parameters Tg = n\B\~'^, 7 = 2. Such a policy is defined in 
section 3.3. Let t(B) denote the time at which ttb has reached Ib rounds and 
record the set of arms Sb that are active at this time. At time t{B) + 1, we 
replace the node B by its children burst(i3) in the current partition. Namely, 
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A(B)+i = (A(B) burst(i?). Moreover, to each bin B' G burst(i?), we assign 

the set of active arms Ib' = Sb- This procedure is repeated until the horizon n 
is reached. 

The intuition behind this pohcy is the following. The parameters of the SE 
policy ttb run at the birth of bin B are chosen exactly such that arms i with 
average gap |/^ — > C|i3|^ are eliminated by the end of rounds with 
high probability. The smoothness condition ensures that these eliminated arms 
satisfy f*{x) > f^'^\x) for all x E S so that such arms are uniformly suboptimal 
on bin B. Among the kept arms, none is uniformly better than another so bin B 
is burst and the process is repeated on the children of B where other arms may 
be uniformly suboptimal. The formal definition of the abse is given in Policy 3. 

Policy 3 Adaptively Binned Successive Elimination (abse) 

Input: Set of arms X = {1, . . . , K}. Parameters n, Co = 
Output: TTi, . . . , 7r„ £ X. 
t ^0, k ^0, C ^ {X}, Sx ^X 

Initialize a SE policy -ifx with parameters T = n,"f — 2 and arms X = Sx- 
tx ^ 0. 

for t — 1 , . . . , n do 

B ^ C{Xt). 

Ib tB + 1. /count times Xt G B/ 

■S"t TTB.iB (observe y/'^*'). /choose arm from SE policy ttb/ 

Tb TB,tB /update number of rounds for tts/ 

if TB > iB and \B\ > 2"'''o+^ and ISs.tsI > 2 /conditions to burst(B)/ 

then 

S Sfl.tB /assign remaining arms as initial arms/ 

for B' £ burst(B) do 

Initialize a SE policy tvb' with parameters T = n\B'\'^,^ — 2 and arms X — S. 

Ib' ^ 0. /set time to for each new SE policy/ 

end for 

C-(—C\B /remove B from current partition/ 

£ £ U burst(-B) /add B's children to current partition/ 

end if 
end for 



The ABSE policy, whose pseudo-code is described in Policy 3, satisfies the fol- 
lowing theorem. 

Theorem 5.1. Fix (3 G (0, 1], L > 0, a > 0, assume that n > Klog{K) and 
consider a problem in A4^{a, f3, L). If a < oo, then the abse policy fr has an 
expected regret at time n bounded by, 

^Rn{^) < Cn { > \ 

where C > is a positive constant that does not depend on K . If a = oo, then 
TERni^) < CK login). 

Note that the bounds given in Theorem 5.1 are optimal in a minimax sense 
when K = 2. Indeed, the lower bounds of Audibert and Tsybakov (2007) and Rigollet and Zeevi 
(2010) imply that the bound on the expected regret cannot be improved as a func- 
tion of n expect for a constant multiplicative term. Note that the lower bound 
proved in Audibert and Tsybakov (2007) implies that any policy that received 
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information from both arms at each round has a regret bound at least as large as 
the one from Theorem 5.1, up to a multiplicative constant. As a result, there is 
no price to pay for being in a partial information setup and one could say that 
the problem of nonparametric estimation dominates the problem associated to 
making decisions sequentially. 

Note also that when a = oo, Proposition 3.1 implies that there exists a unique 
optimal arm over X and that all other arms have reward bounded away from 
that of the optimal arm. As a result, given this information, one could operate 
as if the problem was static by simply discarding the covariates. Theorem 5.1 
implies that in this case, one recovers the traditional regret bound of the static 
case without the knowledge that a = oo. 

Proof. We first consider the case where a < oo, which implies that a/3 < d; 
see Proposition 3.1. 

We keep track of positive constants by numbering them ci,C2,... On each 
newly created bin B, a new SE policy is initialized and we denote by Yb\,Yb2, ■ ■ ■ , 
the rewards obtained by successive pulls of a remaining arm i. Their average after 
T rounds/pulls is denoted by 

s=l 

For any integer s, define eB,s = 2U{s,n\B\'^), where U is defined in (2.1). 
For any B ^ T* \ {X}, define the unique parent of B by, 

p{B) := {B' £T* : B£ burst(S')} • 

and p{X) = 0. Moreover, let p^{B) = p{B) and for any k > 2 define recursively 
p^{B) = p{p^^^{B)). Then the set of ancestors of any S G 7~* is denoted by 
V{B) and defined by 

r{B) = {B' £T* : B' = p^{B) for some k > 1} . 

Denote by r^™(i3) the regret generated by the abse policy vf when covariate Xt 
fell in a live bin B £ Ct, where we recall that Ct denotes the current partition at 
time t. It is defined by 

n 

t=i 

We also define Bt := Us<t to be the sets of bins that were born at some time 
s < t. We denote by rjj°''°(i?) the regret generated when covariate Xt fell in such 
a bin. It is defined by 

n 

^n°™(^) = Y.^riXt) - G B)1{B G Bt) . 

t=i 

Observe that if we define := r^°™(A:'), we have JERni^) = JEr n since X €z Bt 
and Xt € X for all t. Note that for any B £ T* , 

(5.22) r}i°^^{B) = r'r{B) + ' 

_B'eburst(_B) 
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Denote by Xb = S_B,ts the set of arms left active by the SE pohcy ttb on B at the 
end of rounds. Moreover, define the following reference sets of arms: 

Is :=|iG {!,..., K} : sup - < co|S|^ 

I x&B 

Tb:= (i G {!,..., if} : sup (x) - < 8co|S|^ 

Define the event Ab '■= {^b — -^B ^ ^b} on which the remaining arms have a 
gap of the correct order and observe that (5.22) implies that 



born 



[B) = r^°™(S)I(^^) + r'riBnAB) + ^ r}["^^{B')l{AB) • 

_B'eburst{_B) 



Let C* denote the set of leaves of T* , that is the set of bins B such that | i? | = 2 
As a result, the quantity we are interested in can be decomposed as f„ = fn{T* \ 
C*) + fn{C*) where 

UT*\n:= [rl^'''{B)HA'B)+r'r{B)^{AB)) \{ ^Ab') , 

B&T*\C* B'eV(B) 

is the regret on the non-terminal nodes and 

BeC* B'£V{B) BeC* B'er{B) 

is the regret on the leaves. Our proof relies on the events Gb '■= ClB'eViB) •^B'- 

First part: control of the regret on the non-terminal nodes 

Fix B £ T*\C* . On we have Xp{B) ^ -^p{B) so that any active arm i G ^p{B) 
satisfies sup^gp(B) \r{x) - f'-'\x)\ < 8co|p(5)|^. Defining ci := it yields 



IE 



rJ^™(B)I(gB n Ab) < ciKiB\B\"qB 



where qb = Px {O < f* — P ^ ci\B\^ | X G i?). We can always assume that n 
is greater than wq G IN, the integer defined by 



no 



if log (if) 



so that ci2-''o^ < 6o 



and let ki < ko be the smallest integer such that ci2~''^^ < Sq. Indeed, if n < no, 
the result is true with a constant large enough. 

Applying the same argument as in (4.18) yields the existence of C2 > such 
that, for any G {0, . . . , k^}, 

Y QB < 022'^'-^'^^ . 



\B\=2- 



-k 
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Indeed, for k > ki one can define C2 = and the same equation holds with 

C2 = 2'^''i if A; < ki. Summing over all depths k < ko — 1, we obtain 

(5.23) 

< ^Qi^l] 2'=('^+^-"^) log (n2-(2/3+'^)^ 
- fc=0 



IE 



B&T*\C* 



On the other hand, for every bin B £ T* \C*, one also has 



(5.24) 



IE 



„born 



{B)iigB n A%) < cin\BfqBPx{B)JP{gB n A%) 



It remains to control the probability of n we define P^s(.) — p( . n 
Qb)- On Gb, the event Ab '^^^ occur in two ways: 

(i) By eliminating an arm i E at the end of the at most Ib rounds played 
on bin B. These arms satisfy sup^.g^ f^i^o) — f^^\x) < cq\B\^; this event is 
denoted by V^. 

(ii) By not eliminating an arm i ^ Ib within the at most Ib rounds played on 
bin B. These arms satisfy sup^g5/*(x) — f^^\x) > 8cq\B\^; this event is 
denoted by V^. 

We use the following decomposition 

(5.25) (A'b) = P^^ (Pfi) + P^^ (P| n (Vj^y) . 

We first control the probability of making error (i). Note that for any s < Ib 
and any arms i G Ib > ^' ^ ^p{ B) ; it holds 

fV-f^^<fB-fi^<co\Bf<eB,,. 



Therefore, if an arm i £ Ib is eliminated, that is if there exists i' G ^p{B) such 

oes not belong to 
for some s < £b- 



that '^B s ~^b \ ^ ^B,s for some s < £b, then either f^^ or ^ does not belong to 



its respective confidence interval 
Therefore, 



^B,s =■= 2 



or 



_|_ £B,i 

^B,s =■= 2 



Kb 



(5.26) P^^(P|j)<p{3s<^b; 3i e I^^b); \yP - Ib] > ^-Y-} - n\B\^ ' 

where in the second inequality, we used Lemma A.l. 

Next, we treat error (ii). For any i ^ X^, there exists x^^^ such that f*{x^^^) — 
/«(xW) > 8co|5|^. Let i = e I he any arm such that r(j;W) = f'^'Hx^'^); 
the smoothness condition implies that 



(5.27) f^^ > /«(x«) - co|i?|'^ > /«(xW) + 7co\Bf > + 6co|5|^ . 

On the event {V^Y, no arm in X^, and in particular any of the arms G 
Xp(5) \ Ib has been eliminated until round £b- Therefore, the event n {V^Y 

occurs if there exists i ^ Ib such that Y^\^ — Yb\b — '^^b/b- view of (5.27) 
and (5.20), it implies that there exists i £ ip{B) such that 



{i) 



B,tB J ts \ — 2 
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Hence, the probability of error (ii) can be bounded by 

(5.28) JP^nvln{V},r)<w{3ieI,^B) : l^^il - /l^l > ^} < 



23 



n 



where the second inequaUty is a consequence of the Hoeffding-Azuma inequal- 
ity (A.l). 

Putting together (5.25), (5.26), (5.28) and (5.21), we get 



W^^A%) < AK-^ < 4C,-|S|-(2/^+'^) log(n|S|(2/3+'^)) . 
nlBr n 



Together with (5.24), it yields that the expected regret on any B ^ T* \C* \s 
bounded by 



E 



r^°™(5)]I(gB n A%)\ < C3i^|S|-(^+'^) log(n|5|(2/3+rf))gBPx(5) . 



If k is an integer such that ci2 > Sq, then any bin B such that = 2 satis- 
fies IE [r^°'"°(S)]I(gij n A%)] < CAKlogn. If /c is an integer such that ci2-^^ < 6o, 
then the above display together with the margin condition yield 



E 



rr°(s)i(gBn^^) 

\B\=2-'= 



< c5K2'=(^+'^-« log(n2-'=(2/3+d)^ 



Summing over all depths k = 0, . . . , ko — 1 and using (5.23), we obtain 



ko-l 



(5.29) E[f=„(r* \ £*)] < ceK ^ 2'=(^+'^-"/^) log(n2-'=(2/3+d)j 



k=0 



We now compute an upper bound on the right-hand side of the above inequality. 
Fix k = 0, . . . , kf) and define 



1^ r,k(d+l3-l3a) 1 

j=0 



2d+l3-l3a _ I 



Observe that 

where cy := 2'^~^^~^°' — 1. Therefore, (5.29) can be rewritten as: 



nrn(.T*\C*)]<ceK 

<CqK 

<csK 



■fco-l 



k=l 



{Sk - Sk-i) log ln[c7Sk + 1] "+^-^1^- ) + log n 



log |^n[c7X -|- 1] j dx -|- log n 

^ko(d+P-fSa) ^^^-ko(d+2^)^ ^ i^g 



(5.30) 



< Cgn 



n 



K\og{K) 



/9(l+a) 
d+2/3 



imsart-sts ver. 2009/08/13 file: PerRigll_arxiv_rev.tex date: June 12, 2012 



24 



PERCHET AND RIGOLLET 



where we used (5.19) in the last inequality and the fact that log(n) is dominated 

by n^-P{^+'^)/{d+W) since a/3 < d. 

Second part: control of the regret on the leaves 

Recall that the set of leaves C* is composed of bins B such that \B\ = 2~'^". 
Proceeding in the same way as in (5.24), we find that for any i? G £*, it holds 



IE 



„live / 



{B)^[gB) < cin\B\^Px{0 < f -f < ci\Bf,X £ B) . 



Since n > no, then c\2 '^"^ < 5q and using the margin assumption, we find 
(5.31) Y.^Vn^'iBMOB) 



< cin2-'=o^(i+") < cm 



Bee 

where we used (5.19) in the second inequality. 



n \ d+2/3 



Klog{K) 



The theorem follows by summing (5.30) and (5.31). If a = +oo, then the 
same proof holds except that log(n) dominates 2^"(^+'^~"^) log(n2~'^''(^^"'"'^)) in 
Equation (5.30). | 



APPENDIX A: TECHNICAL LEMMA 

The following lemma is central to our proof of Theorem 2.1. We recall that a 
process Zt is a martingale difference sequence if IE [Zt+i \Zi,...,Zt\ =0. More- 
over, if — 1 < Zt < 1 and if we denote the sequence of averages hy Zt = j Yll=i Zg, 
then Hoeffding-Azuma's inequality yields that, for every integer T > 1, 



(A.i) ^1^^- yl^^^G)}-'^- 

The following lemma is a generalization of this result: 

Lemma A.I. Let Zt be a martingale difference sequence with —l<Zt<l 
then, for every 6 > and every integer T > 1, 



JP {3 t < T, Zt> 2jhog(^'^ ]}<5. 



t °\5t 



Proof. Define ej = 2^1 log We first recall the Hoeffding-Azuma max- 

imal concentration inequality. For every r/ > and every integer t > 1, 

2^ 



P{3 s<i, sZ, >ry} < exp (^-|^^ 
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Using a peeling argument, one obtains 

Llog(T)J 2™+i-l 

F{3t<T, Zt>et} < J2 ^{ U i^t^^t}} 

m=l t=2^ 
Llog{T)J 2™+! Llog{T)J 2™+i 

< 1] w[[j{Zt> £2-^+1}} < E ^{ U ^ 2'-e2'"H-i}} 

m=l t=2™ m=l t=2™ 

m=l \ / m=l 

Hence the result. 
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